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THE  TRANSFORMATION  OF  SENTENCES  FOR 

INFORMATION  RETRIEVAL 

★ 

Jane  J .  Robinson 

The  RAND  Corporation,  Santa  Monica,  California 


The  sentence  as  a  unit  of  language  stands  midway  be¬ 
tween  the  word  and  the  paragraph.  If  words  are  the  basic 
units  for  classification  and  indexing  and  paragraphs  the 
basic  units  for  abstracting,  the  sentence  is  the  basic 
unit  for  fact  retrieval. *  Very  simply,  the  central  prob¬ 
lem  of  fact  retrieval  is:  Given  an  interrogative  sentence, 
how  does  one  recognize  a  matching  sentence  that  supplies 
an  answer?  The  simplest  case  is  a  sentence  beginning  with 
an  interrogative  word  followed  by  a  string  of  additional 
words,  matched  by  a  sentence  which  replaces  the  interroga¬ 
tive  with  an  answering  word  or  phrase. 

Who  invented  the  flying  shuttle? 

John  Kay  invented  the  flying  shuttle. 


Any  views  expressed  in  this  paper  are  those  of  the 
author.  They  should  not  be  interpreted  as  reflecting  the 
views  of  The  RAND  Corporation  or  the  official  opinion  or 
policy  of  any  of  its  governmental  or  private  research 
sponsors.  Papers  are  reproduced  by  The  RAND  Corporation 
as  a  courtesy  to  members  of  its  staff. 

This  paper  was  presented  at  the  1965  Congress  of 
International  Federation  for  Documentation  (FID),  Washington, 
D.C. ,  October  1965. 

f 

MFact  retrieval"  is  not  a  well-defined  term;  "data 
retrieval"  or  "text  retrieval"  are  substitutes.  All  that  is 
intended  here  is  to  distinguish  between  the  problem  of  pro¬ 
viding  references  ("document  retrieval")  and  the  problem  of 
providing  statements  within  documents  that  can  answer 
specific  questions.  The  problem  of  truth  is  something  else 
again  and  lies  outside  the  scope  of  syntactic  analysis  as 
treated  in  this  paper. 
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If  all  cases  were  so  simple,  a  computer  could  be  pro¬ 
grammed  to  find  the  matching  sentences  within  a  large  store 
of  text  much  more  easily,  accurately,  and  completely  than 
humans  could.  As  usual,  the  simplest  cases  are  vanishingly 
rare,  and  so  far  no  computer  programs  can  cope  adequately 
with  the  shifting  word  orders  and  the  forms,  sometimes 
protean,  sometimes  elliptic,  that  sentences  in  natural 
language  texts  most  frequently  take. 

The  difficulty  is  that  the  basic  meanings  represented 
by  sentences  are  not  isomorphic  with  their  surface  forms, 
and  the  computer  can  deal  directly  only  with  forms.  In 
terms  of  meaning ,  the  sentence 

John  Kay  invented  the  flying  shuttle  in  1733. 
is  the  matching  answer  for  both 

By  whom  was  the  flying  shuttle  invented? 

and 

When  was  the  flying  shuttle  invented? 

But  these  examples  have  already  complicated  the  mechanical 
definition  of  the  procedures  for  recognizing  the  match.  It 
is  more  complicated  still  to  provide  for  mechanical  recog¬ 
nition  of  matches  within  sentence  boundaries,  where  the 
answer  is  contained  in  phrases  such  as:  "...  the  in¬ 
ventor  of  the  flying  shuttle,  Kay  .  ,  .  .  Kay's 

invention  of  the  flying  shuttle  in  1733  .  .  .,"  etc. 

I  have  posed  the  problem  in  terms  of  finding  a  mechan¬ 
ical  procedure  for  question-answering,  not  primarily  to 
assess  the  state  of  the  art  of  automatic  language  proces¬ 
sing  for  information  retrieval,  but  because  these  terms 
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make  more  clear  and  concrete  the  general  problem  of  recog¬ 
nizing  relevance  and  sameness  of  meaning  at  the  sentence 
level  in  spite  of  formal  differences.  (The  related  prob¬ 
lems  of  synonymy  at  the  word  level  and  of  pronoun  reference 
across  sentence  boundaries  may  be  more  amenable  to  solu¬ 
tion  if  the  problems  of  sentence  structure  are  solved  first.) 
Heuristic  methods  for  dealing  witn  any  single  paradigmatic 
set  of  examples  of  the  sort  cited  above  are  possible,  but 
heuristic  or  ad  hoc  procedures  have,  so  far,  proved  in¬ 
adequate  to  deal  with  the  bewildering  variety  of  sentences 
in  natural  text.  We  need  a  general  procedure  firmly  grounded 
on  an  understanding  of  the  basic  processes  of  sentence  con¬ 
struction  provided  by  the  grammar  of  a  language.  We  cannot 
tell  a  computer  how  to  recognize  paraphrases  unless  we 
understand  how  we  ourselves  recognize  them. 

Of  course,  sameness  of  meaning  and  difference  of  form 
confront  us  all  the  time.  Our  universes  of  experience  and 
of  discourse  are  both  in  a  constant  state  of  flux  and  no  man 
ever  steps  into  the  same  river  or  says  the  same  thing  twice. 
The  river,  the  acoustics,  and  the  man  change  through  time. 

Yet  all  our  acquisition  and  organization  of  knowledge  rests 
on  our  perceiving  similarities  and  continuities,  in  spite 
of  objective  differences.  For  various  human  purposes,  we 
regard  different  items  of  experience  as  instances  of  the 
same  thing  and  we  judge  their  differences  to  be  irrelevant. 
Moreover,  we  can  communicate  our  knowledge  to  each  other 
with  ease  and  accuracy  only  to  the  extent  that  we  ourselves 
are  similar.  Only  through  shared  experience  and  shared 
conventions  can  we  speak  the  same  language  and  classify 
documents  in  the  same  terms. 
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But  if  communication  implies  a  community  of  custom 
and  of  language,  various  layers  in  that  community  can 
tolerate  varying  amounts  of  divergence  from  convention. 

So  long  as  individual  behavior  does  not  deviate  from  some 
basic  set  of  implicitly  defined  conventions,  eccentricities 
and  idiosyn*.rasi es  ai  e  Lulerable.  In  some  s*. cap-  of  be¬ 
havior,  however,  we  must  eliminate  individual  differences 
in  the  interests  of  communication.  A  detailed  account  of 
a  laboratory  experiment  is  written  for  the  most  part  in 
the  passive  voice:  the  centrifuge  was  operated  and  the 
amount  of  the  isotope  was  measured,  and  the  individual 
characteristics  of  the  operator  and  the  m  asurer  should 
not  matter.  The  terms  of  the  scientist  are  more  rigorously 
defined  and  his  sentences  more  conventionally  constrained 
because  his  statements  and  descriptions  often  presuppose 
interchangeability  among  observers  and  experimenters. 

Thus,  attempts  to  automate  translation  from  one  language 
to  another  have  started  with  scientific  reports  rather 
than  with  poetry. 

So  also,  the  amount  of  tolerable  divergence  differs 
from  layer  to  layer  within  language.  If  we  are  to  com¬ 
municate  at  all,  the  conventions  of  language  are  most 
sharply  defined  and  restrictive  at  the  lowest  level — that 
of  the  basic  units,  the  phonemes,  and  the  letters.  New 
words  come  easily  into  our  vocabularies,  but  the  phonemes 
that  represent  them,  the  letters  that  in  turn  represent 
the  phonemes,  and  the  rules  for  combining  them  into  syl¬ 
lables,  change  with  glacial  slowness.  At  a  higher  level, 
the  vocabulary  appears  to  be  not  only  larger,  when  we 
compare  the  stock  of  morphemes  or  words  to  the  stock  of 
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phonemes,  but  subject  to  more  rapid  change.  It  is  a  rela¬ 
tively  open  system.  But  the  new  words  are  for  the  most 
part  nouns  and  verbs  like  astrogation ,  astrogator ,  and 
astrogate ,  whose  parts  are  familiar.  Furthermore,  the 
most  frequent  words  of  our  vocabulary-- the  pronouns,  the 
prepositions,  the  auxiliaries,  etc. --show  little  alteration 
through  time. 

The  number  of  letters  is  finite  and  small;  the  number 
of  words  or  morphemes  is  finite  though  large.  Given  an 
alphabet,  therefore,  a  computer  can  match  letters  and  words 
with  mechanical  regularity,  and  relieve  us  of  the  work  of 
making  indexes  and  concordances.  But  when  one  comes  to  the 
level  of  the  sentence,  the  possibilities  are  infinite. 
Setting  aside  those  instances  of  quotation  and  barring 
multiple  copies  of  the  same  document,  how  many  times  can 
one  expect  to  find  a  repretition  of  any  given  sentence  in 
a  large  collection  of  documents?  If  "sentence"  is  defined 
as  any  stretch  of  words  between  one  mark  of  end  punctuation 
and  another,  the  probability  of  finding  a  repetition  is 
extremely  slight. 

The  reason  for  this  flowering  of  individuality  at  the 
sentence  level,  the  property  of  natural  languages  that  both 
provides  for  it  and  makes  it  tolerable  to  the  community,  has 
become  clearer  in  recent  years,  principally  through  the 
theoretical  work  in  linguistics  primarily  associated  with 
Chomsky  and  Harris  and  their  respective  schools  [1,2,3]. 
Briefly,  it  is  because  the  rules  for  sentence  construction 
are  recursive;  that  is,  a  basic  sentence  unit  or  "kernel" 
can  embed  within  itself  another  basic  unit,  which  can  embed 
another  in  turn,  and  so  on  ad  infinitum.  Some  embeddings 
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are  obvious  at  the  surface,  as  in 

Lewis  Paul  knew  that  John  Kay  invented  the 
flying  shuttle. 

More  often  they  are  transformed,  as  in 

Lewis  Paul  knew  John  Kay,  the  inventor  of  the 
flying  shuttle, 

or 

Lewis  Paul  knew  about  the  invention  of  the 
flying  shuttle  by  John  Kay. 

The  dependency  graphs  [41  of  Fig.  1  show  how  the  under¬ 
lying,  untrans formed ,  basic  structures  embedded  in  these 
three  sentences  might  reasonably  be  represented. 

These  graphs  exemplify  the  reduction  of  different 
surface  structures  with  the  same  basic  meaning  to  strongly 
similar,  embedded,  "canonical"  forms  representing  that 
meaning.  Such  a  reduction,  a  many-one  mapping  of  surface 
structures  onto  a  relatively  few  deep  structures,  suggests 
a  finite  "alphabet"  for  sentences,  roughly  analogous  to 
the  alphabet  for  words,  so  that  mechanical  matching  pro¬ 
cedures  for  meanings  through  the  matching  of  forms  can  be¬ 
come  feasible.  Even  if  mechanical  procedures  prove  im¬ 
practicable,  the  insights  gained  into  the  representation  of 
meaning,  especially  the  representation  of  the  "same"  mean¬ 
ing  in  formally  different  sentence  structures,  may  help  us 
devise  more  standardized  ways  of  storing  information  and 
constructing  data  bases  for  question-answering  or  deduc¬ 
tive  systems  in  information  retrieval. 

It  is  not  the  sentences,  but  their  kernels  that  appear 
to  be  the  units  for  representing  meaning.  One  important 
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point  is  that  embedding  and  transformation  permit  the 
construction  of  sentences  containing  many  basic  meanings, 
related  to  each  other  in  various  ways,  and  this  in  a  sense 
makes  sentences  more  efficient  storage  devices;  conse¬ 
quently  a  singj.e  sentence  often  provides  answers  to  many 
different  questions.  Also,  the  point  of  view  of  the 
questioner  need  not  be  strongly  similar  to  that  of  the 
writer  whose  sentence  contains  an  answer.  The  surface 
structure  of  the  writer's  sentence  can  reflect  some  of 
his  immediate  purposes  for  organizing  his  information, 
for  emphasizing  some  aspect  of  it  and  subordinating  others, 
as  well  as  his  individualities  of  style.  The  structure 
of  the  questioner's  interrogative  can  reflect  a  different 
immediate  purpose  and  a  different  style.  Communication 
is  still  possible,  because  the  deep  structures  of  their 
sentences  adhere  to  the  same  conventions. 

In  the  last  six  years,  several  research  groups  have 
attacked  the  problem  of  designing  automated  question- 
answering  systems  based  on  natural  text  rather  than  on 
highly  structured  data  bases,  and  various  techniques  for 
combining  syntactic  and  semantic  analyses  have  been  used 
[5,6, 7\  The  view  adopted  here  is  that  semantic  (and 
other)  techniques  will  prove  more  effective  if  applied 
after  a  syntactic  analysis  that  explicates  the  deep  struc¬ 
tures.  That,  of  course,  depends  upon  the  development  of 
detailed  transformational  grammars. 

This  view  is  borne  out  by  the  difficulty  encountered 
by  current  automated  parsing  grammars  assigning  structural 
descriptions  directly  to  sentences.  Applied  heur istically , 
ng)  they  miss  valid  structural  assignments  that  correctly 


correlate  an  expression  with  equivalent  paraphrases  and 
relevant  questions  ”8,9].  Applied  algorithm cally ,  they 
produce  an  unmanageable  number  of  parsings,  and  a  surpris¬ 
ing  proportion  of  them  correspond  to  possible  ambiguities 
and  thus  are  not  eliminable.  Many,  if  not  most,  of  these 
ambiguities  arise  because  the  transformation  of  embedded 
sentences  may  lead  to  constructional  homonymity  on  the 
surface  of  the  sentence,  as  in  the  famous  "Flying  planes 
can  be  dangerous."  Moreover,  the  necessary  co-occurrence 
rules  become  unmanageably  numerous  if  written  for  all 
surface  structures  rather  than  for  the  smaller  set  of  deep 
structures . 

One  avenue  to  be  explored  is  to  subject  each  of  these 
multiple  analyses  of  a  sentence  produced  by  a  loosely  con¬ 
structed  "surface"  grammar  to  inverse  transformations,  com¬ 
paring  the  results  with  a  tight] y  constructed  grammar  to 
find  the  simpler  deep  structures  from  which  any  valid 
surface  structure  must  be  derived.  For  example,  the  sur¬ 
face  grammar  would,  typically,  produce  two  analyses  for 
"John  was  drunk  by  midnight":  one  would  label  it  a  passive, 
corresponding  to  "Midnight  drank  John."  Comparison  of 
this  inversely  transformed  kernel  with  the  requiiements  of 
a  precise  deep  grammar,  however,  should  reveal  the  presence 
of  co-occurrence  restrictions  on  inanimate  "time"  nouns  with 
verbs  like  "eat"  and  "drink,"  which  require  animate  subjects. 

The  major  linguistic  task,  then,  is  to  provide  detailed, 
analytic,  recognition  grammars  with  transformational  com¬ 
ponents  adequate  to  deal  with  the  complexities  of  the  sur¬ 
face  structures  of  natural  sentences  if  the  necessarily 
ad  hoc  but  ultimately  unsatisfactory  simplifying  assumptions 
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of  current  question-answering  systems  are  to  be  super¬ 
seded.  Until  quite  recently,  transformational  grammars 
have  been  written  to  generate  rather  than  to  analyze, 
although  as  early  as  1961  Matthews  [101  proposed  a  tech¬ 
nique  for  analyzing  a  given  sentence  by  synthesis  from  a 
genei .  tive  grammar. 

Work  on  the  recognition  problem  is  now  underway,  and 
three  different  types  of  grammar  are  evolving  with  trans¬ 
formational  components  designed  to  recover  deep  structures 
automatically.  Kuno  [11]  reports  some  experiments  with 
the  Harvard  predictive  analyzer  to  produce  kernel  sentences 
concurrently  with  the  analysis  of  surface  structures. 
Petrick  [12],  Kay,  and  the  MITRE  Language  Processing  Tech¬ 
niques  Subdepartment  [13]  have  all  proposed  methods  ap¬ 
plicable  to  phrase  structure  grammars.  An  "approximate” 
formalism  to  obtain  structural  descriptions  similar  to 
deep  structures  is  being  developed  by  Lieberman,  et  al . , 
at  IBM  [14].  Although  applied  to  a  pnrase  structure 
grammar  now,  the  formalism  is  intended  to  be  applicable 
to  other  models  as  well.  Robinson  experimented  briefly 
with  a  paraphrasing  routine  for  a  phrase  structure  grammar 
[9],  but  is  currently  designing  a  dependency  grammar  with 
transformations,  in  collaDoration  with  Hays  and  Kay. 

Several  machine  translation  groups  are  also  incor¬ 
porating  transformational  features  into  their  grammars, 
in  accord  with  Harris’  assumption  that  many  languages  are 
more  similar  in  their  kernel  sentences  than  in  their  total 
surface  structure.  Linguistic  work  in  translation  is 
obviously  an  important  part  of  information  retrieval,  but 
can  only  be  mentioned  here. 


-11- 


\ 


It  would  be  unrealistic  to  suppose  that  practical 
programs  for  automating  the  retrieval  of  information  ex¬ 
pressed  in  natural  text  will  be  forthcoming  in  the  next 
few  years.  Experience  with  machine  translation  has  shown 
that  to  extrapolate  from  progress  made  in  early  stages 
with  simpler  patterns  of  natural  languages  can  lead  all 
too  easily  to  speciously  optimistic  predictions  of  early 
success.  Nevertheless,  a  cautious  optimism  can  be  based 
upon  certain  signs.  Detailed  knowledge  about  the  languages 
is  accumulating.  At  the  same  time,  the  capacity  of  com¬ 
puters  to  handle  masses  of  non-numerical  information  is 
increasing  and  the  iteration  beeween  man  and  machines  is 
becoming  easier  as  well  as  faster.  Most  promising  of  all, 
from  a  linguist's  point  of  view,  is  the  development  of  a 
theoretical  framework  in  linguistics  within  which  can  be 
fitted  the  description  of  the  covariance  of  form  and  mean¬ 
ing  at  the  syntactic  level,  extending  beyond  the  morpheme 
and  the  word  and  into  the  sentence,  where  propositions  are 
stated  and  interrogated. 
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